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1.  Introduction 


There  are  several  algorithms  for  dense  matrix-matrix  product  on  highly-parallel  architectures[l-5]. 
These  algorithms  are  designed  for  systolic  architectures.  When  using  a  general  purpose  highly-parallel 
machine  like  the  CHiP[6],  the  systolic  algorithm  is  directly  implemented  by  configuring  the  CHiP  to  match 
the  systolic  machine.  In  computations  that  use  matrix  product  as  one  operation  in  a  long  series  of  opera¬ 
tions,  the  matrices  to  be  multiplied  are  likely  to  be  contained  within  the  processors.  To  use  the  systolic 
algorithms,  the  matrices  must  be  circulated  as  if  they  were  being  fed  in  from  an  external  source.  This  pro¬ 
vides  an  easy  solution,  but  perhaps  not  the  most  efficient  one.  In  addition,  some  of  these  algorithms[2,3,5] 
require  the  matrices  to  be  reshaped  before  the  circulation  can  begin. 

This  paper  presents  an  algorithm  for  dense  matrix-matrix  product  that  assumes  that  the  data  is 
already  contained  within  the  processors.  The  data  does  not  follow  the  circulation  pattern  of  the  systolic 
algorithms  and  no  data  reshaping  is  required.  Also,  this  algorithm  illustrates  the  use  of  the  divide-and- 
conquer  paradigm  in  parallel  algorithm  design.  This  algorithm  has  the  advantage  that  it  can  be  imple¬ 
mented  using  n2  or  n3  processors  giving  running  times  of  0(n)  and  CHlog  n),  respectively. 

In  Section  2  the  algorithm  is  presented.  An  analysis  of  it’s  running  time  is  done  for  n2  processors. 
Section  3  describes  an  implementation  of  the  algorithm  using  the  Poker  system[5].  Section  4  describes  a 
matrix  product  algorithm  for  the  the  Wavefront  Array  Ptocessor(WAP)[4],  and  its  implementation  using 
the  Poker  system[5].  Section  5  describes  a  modified  form  of  the  WAP  algorithm  and  its  implementation.  A 
comparison  of  the  run  times  of  these  algorithms  for  several  sizes  of  matrices  is  given. 

2.  The  Algorithm 

Consider  the  product  of  two  dense  nxn  matrices,  AB  =  C,  using  n2  processors.  Assume  that  n  =  2k 
for  some  constant  k.  Picture  the  processors  as  an  nxn  array  where  the  processors  are  labeled  PE,,  for 
l£i,j£n.  The  matrices  A  and  B  are  initially  distributed  in  the  n2  processors  such  that  a,,  and  by  are  con¬ 
tained  in  PE,j.  After  the  product  we  want  c,,  to  be  contained  in  PE,r 

To  begin  with,  consider  the  2x2  case.  PEn  contains  an  and  bn  -  To  compute  cu  the  values  a!2  and 
b2i  are  needed.  Similarly,  all  other  processors  need  only  2  elements  not  already  stored  at  that  processor.  To 
provide  for  direct  communication,  a  grid  interconnection  structure  is  used.  The  processors  then  send  their 
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ay  value  to  the  other  processor  in  the  same  row,  and  their  by  value  to  the  other  processor  in  the  same 
column  (see  Figure  1).  After  this  communication,  each  processor,  PEy,  has  all  the  data  required  to  compute 
C,j. 

Now  consider  the  nxn  case.  We  use  Strassen’s[7]  matrix  decomposition  where  two  nxn  matrices  can 
be  viewed  as  two  2x2  matrices  of  y  matrices.  The  2x2  matrices  are  then  multiplied  using  matrix  pro¬ 
duct  and  matrix  addition  on  -jj-x-jj-  matrices. 

2  2 

Let  An  be  the  upper  left  -jx-2-  submatrix  of  A.  Also,  let  the  other  3  submatrices  be  At2.  A21,  and 

Ajj.  In  the  same  way,  let  By  be  the  submatrices  of  B,  Cy  the  submatrices  of  C,  and  Py  the  subarray  of  the 
processors.  Then  Ay  and  By  are  contained  in  Py.  As  in  the  2x2  case,  A12  and  B21  are  required  to  compute 
Cu.  If  the  corresponding  processors  in  Pu  and  P12  are  directly  connected  (see  Figure  2),  A!2  can  be  sent  to 
Pn  in  parallel  with  one  communication  step.  B21  can  be  sent  to  Pu  using  a  similar  connection  scheme  in 

one  communication  step.  The  full  connection  structure  connects  PE,,  with  both  PE  „  and  PE  B .  With 

1  *T 

An,  Bn.  Ai2,  and  Bu  in  Pu,  Cu  can  be  computed  by  doing  two  yx-~  matrix  products  and  one  matrix 

addition.  These  products  can  be  done  using  this  same  algorithm  on  the  -jXy  matrices.  The  recursion  will 

stop  after  k-t  levels  when  a  2x2  matrix  product  is  done.  The  matrix  addition  is  performed  element  by  ele¬ 
ment 
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Figure  1:  2x2  product  and  communication 
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FIGURE  2:  4x4  connections 

Each  recursion  level  requires  it’s  own  interconnection  structure.  The  complete  interconnection  struc¬ 
ture,  supplying  an  edge  for  every  communication  in  the  algorithm  on  every  level  of  recursion,  is  the  hyper- 


The  time  required  for  this  algorithm  is  0(n).  To  prove  this  claim,  the  recurrence  relation  for  the  time 


t(n)  =  2te+tt+t0+2t(— ) 

where  t(2)  =  2te+tt+2tTn,  15  is  the  time  for  a  communication  step,  t,  is  the  time  for  a  scalar  addition,  4  is  the 
overhead  time  for  each  recursion  level  and  ^  is  the  time  for  a  scalar  multiplication.  The  closed  form  is 

t(n)  =  (2n-2)te+<n- 1  )t,+<n-2)t0+ntm. 

To  achieve  the  O(log  n)  running  time,  we  need  to  evaluate  both  matrix  products  at  the  same 
time.  The  algorithm  starts  with  n2  processors  active.  These  processors  contain  the  original  matrices,  A  and 


B.  The  processors  communicate  as  in  die  0(n)  algomm.  At  this  point,  each  block  of  processors  has 

two  matrix  products  to  compute.  One  of  the  products  is  sent  to  a  set  of  previously  inactive  processors. 
This  doubles  die  number  of  active  processors.  The  same  algorithm  is  used  to  compute  the  "new"  products. 

After  the  -^-x-2-  products  have  been  computed,  the  processors  that  were  sent  the  second  product,  send  back 

their  result  The  results  of  the  two  products  are  added  element  by  element  to  form  the  result  of  the  nxn 
matrix  product  The  recursion  for  this  algorithm  stops  when  a  2x2  product  is  to  be  computed.  Each  pro¬ 
cessor  does  both  multiplications  and  the  one  addition. 

To  prove  the  claim  of  0(log  n)  time,  the  recurrence  relation  is 
t(n)  =  5tc+t,+t0+t(y) 

where  t(2)=3tc+t,+tni,  and  the  constants  measure  the  same  quantities  as  before,  but  for  this  algorithm.  This 
recurrence  measures  the  time  for  the  original  n2  active  processors.  The  St,,  comes  from  two  t/s  for  the  ori¬ 
ginal  communication,  two  tc’s  from  sending  one  subproblem  to  a  "new”  processor  and  a  tc  for  getting  the 
result  back  from  the  "new"  processor.  The  dosed  form  is 

t(n)  =  (S(log  n  -l)+3)tc+lognt1+(logn  -l)t#+2tm. 
n3 

This  algorithm  uses  —  processors.  It  starts  with  n2  active  processors.  After  the  initial  communica¬ 
tion,  the  n2  processors  are  divided  up  into  4,  -jx-j  sections,  each  having  two  matrix  products  to  compute. 
Every  processor  sends  two  values,  its  part  of  one  matrix  product,  to  an  inactive  processor,  thus  activating  it. 
This  doubles  the  number  of  processors.  We  now  have  8,  -^x—  problems  using  2n2  processors.  Each 

matrix  product  is  then  computed  by  a  "recursive  call".  This  is  one  recursion  level.  At  each  successive 
recursive  level  the  number  of  active  processors  is  doubled.  There  are  log  n  -1  levels  of  recursion.  This 

gives  n2'"  or  -y  active  processors  at  the  evaluation  of  the  2x2  products. 

3.  The  Poker  Implementation 

The  Poker  system[5]  was  used  to  implement  this  algorithm  although  Poker  and  its  sequential  pro¬ 
gramming  language  XX  (dos  equis)  does  not  directly  support  recursion.  The  goals  of  the  implementation 


were  to  follow  the  recursive  algorithm  as  closely  as  possible  with  the  nonrecursive  system  and  to  reduce 
the  complexity  of  the  communication  as  much  as  possible.  Recursion  was  acheived  by  explicit  manipula¬ 
tion  of  a  stack  to  save  data  and  record  the  position  within  the  recursive  algorithm.  The  communication  was 
simplified  by  dividing  it  into  a  phase  for  each  level  of  recursion. 

An  instance  of  the  algorithm  for  n  =  2“  has  log  n  =  k  phases.  Each  phase  has  a  unique  interconnec¬ 
tion  structure.  For  reference,  we  number  the  phases  1,2, ...,k.  Phase  1  connects  processors  together  that  are 


-2-  processors  away.  (See  Figure  3.)  Phase  2  connects  processors  together  that  are  -j  processors  away,  but 

only  in  blocks  of  processors.  There  are  no  connections  between  the  blocks  of  ~x*j-  processors. 
Z  Z  Z  Z 

Finally,  phase  k  connects  blocks  of  2x2  processors  with  the  grid  pattern.  (See  Appendix  B  for  die  16x16 

connections.)  In  each  phase,  a  processor  has  exactly  two  other  processors  connected  to  it.  The  port  that  is 

connected  to  the  processor  in  the  same  row  is  named  "horiz".  The  port  that  is  connected  to  the  processor  in 

the  same  column  is  named  "vert". 

A  phase  is  roughly  equivalent  to  a  recursive  call.  Phase  1  is  run  for  a  "call”  for  an  nxn  product, 
phase  2  is  run  for  a  "call"  for  an  ~x-~  product,  and  phase  k  is  run  for  a  "call”  for  a  2x2  product  The 
phases  must  be  ran  in  the  proper  order  for  a  correct  result.  We  will  discuss  the  correct  order  later. 


Figure  3:  nxn  connections 
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i  To  keep  track  of  "local  variables"  and  the  progress  through  the  algorithm,  a  stack  is  implemented.  A 

"stack  frame”  is  composed  of  three  elements.  The  top  element  of  the  frame  contains  a  tag.  This  tag,  an 
|  integer  value  of  1  or  2,  represents  which  recursive  call  this  frame  is  recording.  The  other  two  elements 

either  contain  two  values  waiting  for  a  later  multiply  or  contain  the  result  of  a  previous  multiply. 

Communication  from  phase  to  phase  is  done  using  the  feature  of  XX  that  identical  declaration  sec¬ 
tions  use  the  same  memory  allocation.  All  XX  codes  have  the  same  prefix  in  their  declarations.  (See  Fig¬ 
ures  4  and  5  and  Appendix  A.)  This  yeilds  an  "unnamed  common."  The  stack  is  part  of  this  common  area. 

The  other  variables  in  the  common  area  are  "aele",  "bele",  "othera",  "otherb”,  and  "top".  At  the  start 
of  the  a  phase,  "aele",  and  "bele"  contain  the  corresponding  values  of  A  and  B  for  the  current  product  The 
variables  "othera"  and  "otherb"  are  used  as  local  variables,  and  "top"  points  to  the  top  of  the  stack.  A 
larger  value  for  "top"  implies  a  larger  stack.  The  value  0  implies  an  empty  stack.  This  variable  is  assumed 
to  be  0  at  the  start  of  phase  1. 


code  mmii;  /*  upper  left  and  lower  right  */ 

ports  vert  horiz; 

begin 

int  vert  horiz; 

int  aele,  bele,  othera,  otherb; 

I*  aele  and  bele  are  the  entries  from  A*B,  the  result  is  left  in  aele  *1 
int  stack  [24  ]; 
sint  top; 

/*  send  values  */ 
horiz  <-  aele; 
othera  <-  horiz; 
vert  <-  bele; 
otherb  <-  vert; 

/*  multiply  */ 

/*  start  "recursion"  */ 
stack[top+l]  >  othera; 
stack[top+2]  >  otherb; 

stackftop+3]  >  1 ;  /*  this  is  first  multiple  */ 

top  >  top  +  3; 

end. 


Figure  4;  XX  code  for  processors  in  the  upper  left  and  lower  right 


Each  phase  is  divided  into  two  distinct  operations.  The  first  is  the  communication  pan.  The  second 
is  the  computation  and  bookkeeping  part  (Phase  1  to  k-l  use  the  codes  "mmii"  and  "mmij" .  Phase  k  uses 
the  codes  "mm2ii"  and  "mm2ij".  See  Figures  4  and  5  and  Appendix  A.)  For  the  communication  part,  each 
processor  sends  its  ay  (aele)  value  out  on  the  port  "horiz".  Similarly,  the  by  (bele)  value  is  sent  out  on  the 
port  named  "vert".  The  values  that  are  received  are  placed  in  the  variables  "others"  and  "otherb.” 

After  the  communication  steps,  A^B^-t-A^B^  must  be  evaluated.  Remember  that  AIt  is  the  upper 
left  quarter  of  the  2x2  matrix  of  matrices  or  scalars.  If  it  is  a  2x2  matrix  of  matrices,  the  case  for  phases  1 
to  k-l,  this  is  computed  by  doing  two  matrix  products  and  a  matrix  add.  Since  these  products  can  not  be 
done  in  parallel,  the  data  for  one  of  these  matrix  products  must  be  stacked.  In  both  the  upper  left  and  the 
lower  right,  "aele”  and  "bele"  contain  the  elements  that  need  to  be  multipled.  To  reduce  data  movement, 
we  then  stack  "others”  and  "otherb".  (See  Figure  4.)  In  the  lower  left  and  the  upper  right,  we  must  stack 
one  of  "aele"  and  "bele"  and  one  of  "othera"  and  "otherb”.  We  arbitrarily  choose  to  stack  "othera"  and 
"bele".  (See  Appendix  A.)  Since  this  is  the  first  of  the  matrix  products,  the  tag  1  is  stacked. 

After  the  the  new  stack  frame  has  been  added  to  the  stack,  it  is  time  for  the  first  "recursive  call". 
Notice  that  the  current  phase  is  completed.  We  do  not  return  to  this  phase  because  the  setup  for  the 
"second"  recursive  call  is  done  in  phase  k  (2x2  connections).  The  work  can  be  done  there  because  no  com¬ 
munication  is  required  to  set  up  for  the  "second”  recursive  call.  Also,  the  matrix  addition  is  performed  in 
phase  k  for  the  same  reason.  The  recursive  calls  are  implemented  by  running  the  next  phase.  For  example, 
the  recursive  calls  associated  with  phase  1  are  performed  by  running  phase  2. 

Consider  the  operation  of  phase  k.  Remember  that  the  communication  is  for  2x2  matrix  product  of 
scalars.  The  communication  takes  place  in  the  same  manner  as  all  other  k-l  phases.  After  the  communica¬ 
tion,  each  processor  contains  all  the  data  necessary  to  compute  the  ctJ  for  the  2x2  product  The  processors 
in  the  upper  left  and  lower  right  compute  "aele*bele+othera*otherb".  (See  Figure  5.)  The  processors  in  the 
lower  left  and  the  upper  right  compute  "aele*otherb+othera*bele".  (See  Appendix  A.)  The  result  is  placed 
in  "aele". 

At  this  point,  the  recursion  is  terminated.  We  now  simulate  the  returns.  This  is  where  matrix  addi¬ 
tion  is  done  along  with  stack  clean  up.  If  the  tag  on  the  top  stack  frame  is  2,  the  stack  frame  contains  the 
result  of  a  matrix  multiply  that  needs  to  be  added  to  the  value  in  "aele".  After  the  addition,  the  stack  frame 


code  mm2ii;  /*  upper  left  and  lower  right  */ 

pons  horiz,  vert; 

begin 

int  horiz,  vert; 

int  aele,  bele,  others,  otherb; 

/*  aele  and  bele  are  the  entries  from  A*B,  the  result  is  left  in  aele  */ 
int  stack  [  24  ]; 
sint  top; 
bool  second; 

/*  send  values  *1 
horiz  <-  aele; 
othera  <-  horiz; 
vert  <*  bele; 
otherb  <-  vert; 

I*  multiply  */ 

aele  >  aele  *  bele  +  othera  *  otherb; 

/*  clean  up  stack  and  get  ready  for  next  communication  phase  */ 
second true; 

while  second  &  (top  >  0)  do  begin 
if  stack!  top]  -  1  then  begin 
/*  only  first  multiply  has  been  done  */ 
bele  >  stack[top-l]; 

stack[top-l]  >  aele;  /*  save  result  of  first  multiply  */ 
aele  >  stack[top-2]; 

stack(top]  >  2;  /*  this  is  second  multiply  */ 
second false;  /*  need  to  start  a  new  phase  */ 
end  else  begin 
/*  finished,  just  add  */ 
aele  >  aele  +  stack[top-l]; 
top :« top  -3; 
end; 
end; 

end. 

Figure  5:  XX  code  for  2x2  case. 

is  deleted  from  the  stack.  This  is  repeated  until  the  stack  is  empty  or  the  tag  in  the  stack  frame  is  a  1 . 
When  the  stack  is  empty,  the  nxn  product  is  stored  in  "aele”. 

When  the  tag  in  the  top  stack  frame  is  a  1,  we  need  to  set  up  for  the  second  recursive  call  for  some 
recursive  level.  The  level  is  unknown.  The  actions  are  always  the  same.  The  value  in  "aele"  is  the  result  of 
the  first  product  This  needs  to  be  put  on  the  stack  during  the  second  recursive  call.  The  values  in  the  top 
stack  frame  must  be  put  into  "aele"  and  "bele".  Also,  the  tag  in  stack  frame  must  be  changed  from  1  to  2. 
With  this  done,  it  is  time  to  change  to  the  correct  phase.  This  may  be  any  of  phases  2  to  k. 
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The  {rimes  must  be  run  in  the  correct  Oder.  Since  there  is  no  automatic  control  for  running  phases 
in  Poker,  the  programmer  must  know  the  order.  For  the  nxn  product,  the  order  is  phase  1  followed  by  2 

sequences  of  phases  for  yXy  product  For  the  product  the  order  is  phase  j+1  followed  by  two 

sequences  of  the  order  for  -x~j-  product  For  example,  the  8x8  order  is  phases  1, 2, 3,  3, 2,  3,  and  3. 
After  the  execution  of  all  the  phases,  the  result  of  the  nxn  matrix-matrix  product  is  found  in  "aele". 

4.  The  WAP  Algorithm 

To  evaluate  this  recursive  algorithm,  we  chose  to  implement  and  compare  the  algorithm  for  the 
Wavefront  Array  ProcessocfW  AP)[4] .  Although  the  WAP  may  not  be  a  systolic  array,  the  matrix  product 
algorithm  has  data  flow  patterns  typical  in  systolic  algorithms. 

The  matrix  product  algorithm  for  the  WAP  uses  n2  processors  connected  in  a  nxn  grid.  The  data  is 
fed  in  along  the  top  n  processors  and  from  the  left  n  processors.  The  matrix  A  is  arranged  to  enter  column 
by  column,  starting  with  the  first  column.  The  matrix  B  is  arranged  to  enter  row  by  row,  starting  with  the 
first  row.  (See  Figure  6.)  All  processors  execute  identical  procedures.  The  result,  c,,,  is  initialized  to  zero.  A 
loop  is  executed  n  times  that  reads  an  A  value  from  the  left  and  a  B  value  from  above,  multiplies  them 
together,  and  adds  the  result  to  c,j.  The  A  and  B  values  are  sent  to  the  right  and  down  respectively.  This 
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Figure  6:  WAP  organization 
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causes  the  upper  left  processor  to  be  the  first  processor  to  start  execution.  As  the  data  moves  into  the  array, 
there  is  a  wavefront  of  executing  processors  in  a  diagonal  across  the  array.  PE^,  cannot  start  processing 
until  the  data  wavefront  reaches  it  With  this  delay  and  the  compute  time  necessary  to  compute  cm,  PE^, 


uses 

t(n)  =  (4n-4)tc+n(t,+t0)+ntm 

dine,  where  the  constants  measure  the  same  quantities  as  before,  but  for  this  algorithm.  The  recursive 
algorithm  has  a  factor  of  two  fewer  communication  steps.  This  can  have  an  impact  when  tc  is  the  largest  of 
the  constants. 

This  algorithm  was  also  implemented  on  the  Poker  system.  It  was  modified  to  start  with  the  data  in 
the  processors.  The  processors  then  circulate  the  data  using  end  around  connections  to  connect  the  last 
processor  in  a  row  or  column  to  the  first  processor  in  the  same  row  or  column.  (See  Figure  7  )  Notice  that 
the  matrix  multiply  is  correctly  computed  if  die  a’s  and  b’s  in  Figure  6  were  to  enter  the  array  in  the  reverse 
order.  The  reverse  order  is  the  result  of  a  direct  right  shift  and  down  shift  of  all  the  data  using  the  end 
around  connections. 

The  activity  at  each  processor  is  divided  up  into  two  parts.  The  first  is  to  circulate  the  data.  .See  Fig¬ 
ure  8.)  Each  processor  sends  its  a  to  the  "right"  and  its  b  "down”.  Then,  it  repeats  the  a’s  coming  in  from 


Figure  7: 4x4  connections  for  adapted  WAP  algorithm 


the  "left”  and  the  b  s  from  up".  Alter  all  die  data  has  been  passed  on,  all  that  remains  is  to  compute  its  c,r 
This  is  just  the  accumulation  of  the  a’s  times  the  b’s  in  the  order  that  they  arrive  at  the  processor.  These  a’s 
and  b's  are  sent  on  to  the  next  processor  in  the  same  row  and  column  respectively.  The  only  exception  to 
this  is  the  last  tow  and  column.  There  is  no  need  to  send  the  A  values  "right”  from  the  last  column,  or  to 
send  the  B  values  "down"  from  the  last  row.  This  gives  rise  to  three  special  codes,  (See  Appendix  A.)  one 
for  the  last  row,  the  righmost  column,  and  the  lower  right  pe,  PE^. 

5.  The  Modified  WAP  Algorithm 

The  direct  implementation  of  the  WAP  algorithm  followed  the  wavefront  behavior  of  the  original 
machine.  With  the  flexability  provided  by  then  CHiP  architecture,  we  can  do  some  preprocessing  on  the 


code  matmul  ( size );  /*  WAP,  all  except  last  row  and  column  *1 

ports  left,  right,  up,  down; 

begin 

int  aele,  bele,  cele; 
int  left,  right,  up,  down; 
int  size; 

sint  indx,  PEn,  max; 

PEn  >  size; 

/*  compute  max  of  PEi,  PEj  */ 
if  PEi  >  PEj  then 
max  >  PEi 
else 

max  >  PEj; 

/*  start  wave  around  */ 
right  <•  aele; 
down  <-bele; 

for  indx  >  2  to  max  do  begin 
if  PEj  >-  indx  then  begin  aele  <-  left;  right  <-  aele  end; 
if  PEi  >«  indx  then  begin  bele  <-  up;  down  <-  bele  end; 
end; 

/*  do  the  multiply  *1 
for  indx  >  1  to  PEn  do  begin 
aele  <-  left; 
right  <-  aele; 
bele  <-  up; 
down  <-  bele; 
cele :«  cele  +  aele  *  bele; 
end; 
end. 

Figure  8:  XX  code  for  WAP  matrix  product 
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rows  of  A  and  the  columns  of  B  so  that  all  processors  can  start  working  immediately.  Since  they  all  start  at 
the  same  time,  they  all  stop  at  the  same  time,  eliminating  the  (2n-2)te  time  required  for  the  wavefront  to 
reach  processor  PE^. 

The  first  phase  of  the  algorithm  does  the  preprocessing.  Row  i  shifts  the  elements  of  A  right  by  i-1 
processors.  Column  j  shifts  the  elements  of  B  down  by  j-1  processors.  The  shifts  assume  wrap  around,  that 
is,  processor  PE^  is  left  of  PE,[.  For  this  implementation,  it  was  possible  to  put  a  direct  connection 
between  the  source  and  destination  processors.  Figure  9  shows  the  8x8  connections.  (See  Appendix  A  for 
the  processor  codes.) 

After  the  preprocessing,  each  processor  contains  data  required  to  start  the  sum  of  products  immedi- 
atly.  The  sum  variable  is  initialized  with  the  product  of  the  current  a,k  and  b^.  The  loop  executed  by  each 
processor  sends  the  current  a*  to  the  right,  the  current  b^  down  and  then  adds  to  the  sum  variable  the  pro¬ 
duct  of  the  a,k  and  bkJ  it  just  received  from  the  left  and  above  respectively.  After  n-1  iterations,  the  loop 
terminates. 

The  time  used  to  compute  the  matrix  product  using  the  modifed  WAP  algorithm  is 
t(n)  =  t^n>+(2n-2)tc+(n-l)(ti+t0)+ntm 

where  t,(n)  is  the  time  for  the  preprocessing,  and  the  constants  measure  the  same  quantities  as  before,  but 
for  this  algorithm.  Since  t(n)  is  a  constant  for  our  implementations,  we  can  ignore  it.  For  larger  size  prob¬ 
lems,  this  cost  may  not  be  constant  For  the  16x16  implementation,  the  preprocessing  had  to  be  done  using 
two  phases  due  to  the  number  of  connections  required.  (See  Appendx  C  for  the  interconnection  structure.) 


nxn 

recursive 

(ricks) 

WAP 

(ricks) 

Mod  WAP 
(ticks) 

nJ 

(ticks) 

2x2 

2219 

6458 

4x4 

7571 

14065 

7938 

135814 

8x8 

18359 

30696 

15646 

1150950 

16x16 

40385 

70481 

31354 

9469030 

Table  1:  Running  times  on  Poker  3.0 
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6.  Experimental  Reauhs 

All  of  these  algorithms  were  implemented  far  several  values  of  n.  All  algorithms  were  run  with  two 
extra  phases.  The  first  extra  phase  loaded  the  test  data  into  the  processors  from  a  file.  The  other  extra 
phase  dumped  die  results  into  a  file.  Since  we  were  interested  in  the  performance  of  these  algorithms  start¬ 
ing  with  the  data  in  the  processors,  we  ignored  the  time  used  in  these  extra  phases.  Table  1  gives  a  sum¬ 
mary  of  the  run  times  for  these  algorithms.  One  tick  represents  one  microsecond  on  the  64  processor  Prin¬ 
gle,  the  machine  simulated  by  the  Poker  system.  For  any  size  with  64  or  fewer  processors,  these  times  can 
be  acheived  by  the  Pringle.  For  the  sizes  with  more  than  64  processors,  the  times  are  arrived  at  by  using 
the  same  ground  rules  used  by  the  Pringle. 

Because  of  the  Pringle’s  rather  antiquated  technology,  the  speed-up  comparisons  are  more  relevant 
than  the  absoulte  times.  A  1  processor  system  was  used  to  run  the  standard  3  loop,  n3  algorithm  for  all 
sizes  of  matrices.  (See  Appendix  A  for  the  code.)  The  speed-ups  listed  in  Table  2  are  using  the  sequential 
value  divided  by  the  parallel  value.  We  can  see  the  lack  of  overhead  for  the  2x2  recursive  algorithm. 
Also,  we  can  see  the  effect  of  the  constant  step  in  the  modified  WAP  algorithm. 


PE  count 

recursive 

WAP 

Mod  WAP 

4 

6.78 

2.33 

3.68 

16 

17.94 

9.66 

17.11 

64 

62.70 

37.50 

73.56 

256 

234.47 

134.35 

302.00 

Table  2:  Speed  ups 


7.  Summary 

In  designing  algorithms  for  general  purpose  MIMD  architectures,  it  is  possible  to  adapt  known  algo¬ 
rithms.  We  have  shown  that  this  adaption  may  not  produce  the  most  efficient  algorithm  on  the  new  archi¬ 
tecture.  We  have  shown  two  algorithms  that  had  better  performance  than  the  adapted  algorithm.  One  of 
these  was  a  modification  of  the  orignal  algorithm  taking  advantage  of  the  new  architecture  and  original  data 
placement  The  other  algorithm  showed  that  designing  a  totally  new  algorithm  may  yield  good  results. 
Also,  we  have  shown  that  the  divide-and-conquer  paradigm  is  useful  for  developing  new  algorithm. 
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Appendix  A:  XX  codes 
Recursive  algorithm  codes: 


code  mmij;  /*  lower  right  and  upper  left  */ 

ports  horiz,  vert; 

begin 

int  horiz,  vert; 

int  aele,  bele,  others,  exherb; 

/*  aele  and  bele  are  the  entries  from  A*B,  the  result  is  left  in  aele  */ 
int  stack  [24]; 
sint  top; 

/*  send  values  */ 
horiz  <-  aele; 
othera  <-  horiz; 
vert  <-bele; 
exherb  <-  vert; 

/*  multiply  */ 

I*  start  "recursion"  */ 
stack[top+l]  >  othera; 

$tack[top+2]  >  bele; 
bele  >  otherb; 

stack(top+3] 1 ;  /*  this  is  first  multiply  *1 
top:- top +  3; 

end. 


code  mm2ij;  /*  lower  left  and  upper  right  */ 

ports  horiz,  vert; 

begin 

int  horiz,  vert; 

int  aele,  bele,  othera,  otherb; 

I*  aele  and  bele  are  the  entries  from  A*B,  the  result  is  left  in  aele  */ 
int  stack  [  24  ]; 
sint  top; 
bool  second; 

l*  send  values  */ 
horiz  <-  aele; 
othera  <-  horiz; 
vert  <-bele; 
otherb  <-  vert; 

/*  multiply  */ 

aele  >  aele  *  otherb  *  othera  *  bele; 

J*  clean  up  stack  and  get  ready  for  next  communication  phase  */ 
second true; 

while  second  A  (top  >  0)  do  begin 


r  vLfitAL>i 
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if  stackftop]  -  1  then  begin 
/*  only  fint  multiply  hat  been  done  */ 
bek  >  stack[top-l]; 

stack[top-l] aele;  I*  save  result  of  fint  multiply  */ 
aek  >  stack[top-2]; 

stackltop] 2;  I*  this  is  secottd  multiply  */ 
second  >  false;  I*  need  to  start  a  new  phase  */ 
end  else  begin 
/*  finished,  just  add  */ 
aek  >  aek  +  stack[top-l]; 
top  :•  top  -3; 
end; 
end; 


WAP  codes: 

code  mmlow  ( size );  /*  WAP,  last  row  */ 

ports  left,  right,  up,  down; 

begin 

int  aele,  bele,  cele; 
int  left,  right,  up,  down; 
int  size; 
sint  indx,  PEn; 

PEn  >  size; 

I*  start  wave  around  */ 
right  <•  aek; 
down  <-bek; 

for  indx  >  2  to  PEn  do  begin 
if  PEj  >-  indx  then  begin  aele  <-  left;  right  <-  aele  end; 
if  PEi  >«  indx  then  begin  bek  <-  up;  down  <-  bek  end; 
end; 

/*  do  the  multiply  *1 
for  indx  >  1  to  PEn  do  begin 
aele  <-  left; 
right  <•  aek; 
bek  <-  up; 

cek :«  cele  +  aele  *  bele; 
end; 
end. 


code  mnvig  ( size );  /*  WAP,  rightmost  column  */ 

ports  left,  right,  up,  down; 

begin 

int  aek,  bele,  cele; 
int  left,  right,  up,  down; 
int  size; 


sint  indx,  PEn; 


PEn  >  size; 

/*  start  wave  arour.  ' 
light  <•  aek; 
down  <-beie; 
for  indx  >  2  to  PEn  do  begin 
if  PEj  >-  indx  then  begin  aele  <-  left;  right  <•  aele  end; 
if  PEi  >-  indx  then  begin  bele  <*  up;  down  <-  bele  end; 
end; 

/*  do  the  multiply  */ 
for  indx  >  1  to  PEn  do  begin 
aele  <•  left; 
bele  <»  up; 
down  <•  bele; 
cele  >  cele  +  aele  *  bele; 
end; 
end. 


code  mmlr  ( size );  /*  WAP,  lower  right  pe,  PEn,n*/ 

ports  left,  right  up,  down; 

begin 

int  aele,  bele,  cele; 
int  left  right  up,  down; 
int  size; 
sint  indx,  PEn; 

PEn  >  size; 

I*  start  wave  around  */ 
right  <•  aele; 
down  <-bele; 

for  indx  >  2  to  PEn  do  begin 
if  PEj  >»  indx  then  begin  aele  <-  left;  right  <-  aele  end; 
if  PEi >- indx  then  begin  bele  <•  up;  down  <-  bele  end; 
end; 

I*  do  the  multiply  */ 
for  indx  >  1  to  PEn  do  begin 
aele  <-  left; 
bele  <-  up; 

cele  >  cele  +  aele  *  bele; 
end; 
end. 


Modified  WAP  codes: 

code  aroute;  I*  first  column  except  processor  1,1  */ 

ports  >iit  tout; 

begin 


*• 

V 

§ 


int  aele,  bele; 

/*  communicate  *1 
aout  <-  aele; 
aele  <•  ain; 


code  broute;  /*  fint  row  except  processor  1,1  */ 

ports  bin,  bout; 

begin 

int  aele,  bele; 

/*  communicate  */ 
bout  <•  bele; 
bele  <-  bin; 


code  route;  /*  all  processors  except  fint  row  and  fint  column  */ 

ports  ain,  aout,  bin,  bout; 

begin 

int  aele,  bele; 

/*  communicate  */ 
aout  <-  aele; 
aele  <•  ain; 
bout  <-  bele; 
bele  <-  bin; 


code  matmul  (  PEn ); 
ports  left,  right,  up,  down; 
begin 

int  aele,  bele,  cele; 
int  PEn; 
sint  indx; 

/*  do  the  multiply  */ 
cele  >  aele*bele; 
for  indx  >  2  to  PEn  do  begin 
right  <•  aele; 
aele  <•  left; 
down  <•  bele; 
bele  <•  up; 

cele  >  cele  +  aele  *  bele; 
end; 


Sequential  code: 

code  matmul;  /*  sequential  version  for  comparison  */ 
begin 

int  aele[16,16],  bele[16,16],  cele[16,16]; 

sintsize; 

sinti,j,  k; 

/•  multiply  V 
for  i  >  1  to  size  do 
for  j  >  1  to  size  do  begin 
cele{ij]  >  aele[i,l]  *  bele[lj]; 
for  k  >  2  to  size  do 

celefij]  >  celefij]  +  aele[i,k]  *  belefkj]; 
end; 


Appendix  B:  16x16  connection! 


The  connection  structure  for  the  16x16  is  not  immediately  obvious.  For  completeness,  we  are  includ¬ 
ing  all  the  interconnection  structures  for  the  16x16  product  We  will  show  enough  of  each  structure  so  that 
the  full  16x16  can  be  reconstructed  by  use  of  replication  and  reflection. 


16x16  upper  left  comer. 


